Use case example

This page shows one complete workflow from login to production execution.

Prerequisites

Before starting:

Goal

Run train.py first in an interactive GPU session, then in batch mode on prod10.

1) Connect and create a project folder

ssh dgx
# if you do not use an SSH alias:
# ssh <username>@hubia-dgx.centralesupelec.fr
mkdir -p ~/my_project
cd ~/my_project

2) Create a Python virtual environment

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install numpy torch

Quick check:

python -c "import numpy; print(numpy.__version__)"

3) Test on GPU with an interactive session

srun -p interactive10 --time=00:30:00 --pty bash

Inside the interactive shell:

cd ~/my_project
source venv/bin/activate
python train.py
exit

4) Prepare a batch script

Use the default template:

cd ~/my_project
cp ~/slurm-prod10.sbatch ./job.sbatch
nano job.sbatch

Set at least:

  • job name (#SBATCH --job-name=...)
  • time (#SBATCH --time=...)
  • Python command (python ...)

5) Submit and monitor

sbatch job.sbatch
squeue -u $USER

Inspect one job:

scontrol show job <jobid>
sacct -j <jobid> --format=JobID,State,Elapsed,ExitCode

Read logs:

tail -n 100 slurm-<jobid>.out

Cancel if needed:

scancel <jobid>

6) Scale up only when needed

If the model does not fit in prod10 (10 GB VRAM), move to larger partitions:

  • prod40 (40 GB VRAM)
  • prod80 (80 GB VRAM)

Keep the same workflow, only change partition/time and script content.

7) Optional: work from VS Code

You can use VS Code Remote-SSH to edit files on the DGX.

If the extension gets stuck, try:

  • Remote-SSH: Uninstall VS Code Server from Host
  • reconnect to the host

Next references